Semantic clustering: Identifying topics in source code
نویسندگان
چکیده
Many of the existing approaches in Software Comprehension focus on program program structure or external documentation. However, by analyzing formal information the informal semantics contained in the vocabulary of source code are overlooked. To understand software as a whole, we need to enrich software analysis with the developer knowledge hidden in the code naming. This paper proposes the use of information retrieval to exploit linguistic information found in source code, such as identifier names and comments. We introduce Semantic Clustering, a technique based on Latent Semantic Indexing and clustering to group source artifacts that use similar vocabulary. We call these groups semantic clusters and we interpret them as linguistic topics that reveal the intention of the code. We compare the topics to each other, identify links between them, provide automatically retrieved labels, and use a visualization to illustrate how they are distributed over the system. Our approach is language independent as it works at the level of identifier names. To validate our approach we applied it on several case studies, two of which we present in this paper. Note: Some of the visualizations presented make heavy use of colors. Please obtain a color copy of the article for better understanding.
منابع مشابه
The Impact of Semantic Clustering on Iranian EFL Advanced Learners’ Vocabulary Retention
This study investigated the impact of semantic clustering on Iranian EFL learners’ vocabulary retention at advanced level. Participants were female learners randomly assigned to two groups of 15. Four instruments (TOEFL test; vocabulary pretest; immediate posttest, and delayed recall posttest) were used. The experimental group underwent semantic clustering vocabulary presentation in which the l...
متن کاملOn the Effect of Semantically Enriched Context Models on Software Modularization
Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the se...
متن کاملWord clustering effect on vocabulary learning of EFL learners: A case of semantic versus phonological clustering
The aim of this study is to determine the effect of word clustering method on vocabulary learning of Iranian EFL learners through a case of semantic versus phonological clustering. To this effect, 80 homogeneous students from four intermediate classes at an English institute in Torbat e Heydariyeh participated in this research. They were assigned to four groups according to semantic versus phon...
متن کاملClustering Class Diagram through Mining
A class diagram models the static view of a system. The class diagrams are widely used during construction of executable code for software application as it is the only UML diagram which can be directly mapped with object oriented language. As class diagram contains duplicacy, the redundant source code generated increases the complexity of the program code. A solution is required to remove the ...
متن کاملSemantic Clustering: exploiting Linguistic Information
Many approaches have been developed to comprehend software source code, most of them focusing on program structural information. However, in doing so we are missing a crucial information, namely, the domain semantics information contained in the text or symbols of the source code. When we are to understand software as a whole, we need to enrich these approaches with conceptual insights gained f...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Information & Software Technology
دوره 49 شماره
صفحات -
تاریخ انتشار 2007